Consent handling during data harvesting

ABSTRACT

The described technologies can be used for consent handling during data harvesting. In one example, a method can include receiving social media data associated with a user identifier and a first country code. A stored consent configuration rule can specify whether to store the social media data anonymously or non-anonymously. The consent configuration rule can be associated with a second country code. It can be determined whether the second country code associated with the consent configuration rule matches the first country code associated with the social media data. When the second country code associated with the consent configuration rule does not match the first country code associated with the social media data, the social media data can be stored in a quarantine.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of Indian Provisional ApplicationNo. 5334/CHE/2015, entitled “CONSENT HANDLING DURING DATA HARVESTING,”filed Oct. 6, 2015, the entire disclosure of which is incorporatedherein by reference in its entirety.

BACKGROUND

A manufacturer or provider of services can potentially gain insightsinto consumer sentiment by monitoring user-generated data posted tosocial media platforms. For example, consumers may use various socialmedia platforms to discuss products and services that the consumers havepurchased or are considering purchasing. In particular, a customer mayreveal his or her sentiments by posting information praising aninnovative and well-designed product or criticizing a poorly designed ormanufactured product. A business may enhance its insights by taking intoaccount the sentiments of multiple consumers, such as by aggregating thesentiments of multiple consumers from one or more social media websites.For example, various statistical tools can be used to detect trendsand/or distributions in sentiment. However, data privacy laws may limitthe content and type of data that can be stored during data harvesting.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

In one embodiment, a method can include receiving social media dataassociated with a user identifier and a first country code. A storedconsent configuration rule can specify whether to store the social mediadata anonymously or non-anonymously. The consent configuration rule canbe associated with a second country code. It can be determined whetherthe second country code associated with the consent configuration rulematches the first country code associated with the social media data.When the second country code associated with the consent configurationrule does not match the first country code associated with the socialmedia data the social media data can be stored in a quarantine. When thesecond country code associated with the consent configuration rulematches the first country code associated with the social media data,the social media data can be stored according to the consentconfiguration rule.

In one embodiment, a method can include harvesting social media data togenerate harvested social media data. The harvested social media datacan include a social media channel and a country code. A plurality ofconsent rules can be stored. A respective consent rule of the pluralityof consent rules can specify a format for storing data associated withat least a respective country code or a respective social media channel.The method can include determining whether there is a matching consentrule of the plurality of consent rules corresponding to the social mediachannel or the country code of the harvested social media data. Whenthere is a matching consent rule, the harvested social media data can bestored in the format specified by the matching consent rule of theplurality of consent rules. When there is no matching consent rule, theharvested social media data can be stored in a locked format so thatuser sentiment is not retrievable until a matching consent rule iscreated.

In one embodiment, a system can be used for consent handling during dataharvesting. The system can include a data harvester for collectingsocial media data The collected social media data can include one ormore fields representative of a user identifier, a country code, or asocial media channel. The system can include a first storage device forstoring a plurality of consent rules. A respective consent rule can befor matching a value of a given field of the collected social media datato a format for storing the social media data. The system can include anon-quarantine storage device and a quarantine storage device forstoring collected social media data. The quarantine storage device canbe physically separate from the non-quarantine storage device. Thesystem can include a database system in communication with the dataharvester and the storage devices. The database system can be configuredto determine whether any of the plurality of consent rules match thevalue of the given field of the collected social media data. When thereis a match, the collected social media data can be stored on thenon-quarantine storage device in the format of the matching consentrule. When there is no match, the collected social media data can bestored on the quarantine storage device.

The foregoing and other objects, features, and advantages of theinvention will become more apparent from the following detaileddescription, which proceeds with reference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an example system implementing consenthandling during data harvesting.

FIG. 2 is an example dataflow diagram for consent handling during dataharvesting.

FIG. 3 is a block diagram of an example of collected social media data.

FIGS. 4-5 are flow charts illustrating various example methods formanaging user consent during data harvesting.

FIG. 6 is a flow chart illustrating an example method for processing arequest for social media data.

FIG. 7 is a flow chart illustrating an example method for managing userconsent during data harvesting.

FIG. 8 is a diagram of an example computing system in which describedembodiments can be implemented.

DETAILED DESCRIPTION Overview

Social media analytics can include analyzing user-generated data, suchas social media data, to make business decisions. For example, a searchterm, such as a product name, can be specified and social media datacontaining the search term can be collected or harvested for furtheranalysis. The social media data can be collected by crawling pages ofvarious social media websites and/or by using a social media retrievalservice, such as DATASIFT or GNIP. Social media platforms can includewebsites and/or applications that allow individuals and/or communitiesof individuals to create and share information. Social media websitescan include FACEBOOK, TWITTER, TUMBLR, REDDIT, PINTEREST, FLICKR,GOOGLE+, INSTAGRAM, YOUTUBE, YELP, IMDB, LINKEDIN, TOPIX, andDAILYMOTION, for example. User-generated data can also be gathered fromblogging sites and WIKIPEDIA, for example.

The collected user-generated data can be analyzed as it is collectedand/or stored for analysis at a later time. However, the collection,storage, and/or analysis of user-generated data may be subject todifferent laws depending on where the data is generated, collected,and/or stored. For example, a first set of countries may not have anyrestrictions on using harvested data that containspersonally-identifying information. A second set of countries may allowthe use of harvested data where the user is anonymized or separated fromthe data to be analyzed. A third set of countries may require that auser's consent be obtained before using harvested data that containspersonally-identifying information. The required consent can be obtainedindirectly, such as via a licensing agreement of the social media site,or directly from the user that generated the data.

As described herein, a level of compliance for storing social media incompliance with different countries' laws can potentially be increasedby storing consent configuration rules for multiple countries,determining a country associated with collected social media data, usingthe consent configuration rules to determine a storage format incompliance with the associated country's laws, and storing the socialmedia data in the storage format specified by the consent configurationrules. The content of the stored social media data can then be analyzedby a downstream application, such as a social media analytics program todiscern user sentiments associated with the search term. However, thecountry associated with collected social media data may not match any ofthe stored consent configuration rules. In this case, the collectedsocial media data can be quarantined from downstream applications thatcan analyze contents of the collected social media. For example, thecollected social media data can be physically isolated from the storedsocial media data associated with a country having a matching consentrule. As another example, the collected social media data can be storedin a locked format so that the collected social media data is notaccessible to downstream applications. In other words, the downstreamapplications can be denied read permission of the quarantined socialmedia data. If the consent configuration rules are updated to include arule for the country associated with the quarantined social media data,the quarantined social media data can be reformatted and/or moved to adifferent storage location so that downstream applications can read thecontents of the social media data.

Example System Implementing Consent Handling During Data Harvesting

FIG. 1 is a block diagram of an example system 100 implementing consenthandling during data harvesting. The system 100 can include one or moreserver computers 110 for executing one or more data harvesting programs(e.g., data harvesters 111-113), storage for persisting consent rules120, and storage 130 for persisting collected user-generated data, suchas social media data. The storage 130 can include one or more physicallydistinct storage devices, where respective storage devices can providedifferent read, write, and/or modify permissions to different users,services, and/or applications. The storage 130 can include one or morequarantine storage devices and one or more non-quarantine storagedevices. The non-quarantine storage devices can be used to store datathat is accessible by various downstream applications. However, thequarantine storage devices may have limited access. For example, theaccess may be limited to only a data classifier 115 service or to anadministrator of the system 100. The non-quarantine storage device(s)can include anonymized data 131 and non-anonymized data 132, forexample. The quarantine storage device(s) can include the quarantineddata 133, for example. The storage 120, 130 can include non-volatilememory, magnetic disks, direct-attached storage, network-attachedstorage (NAS), storage area networks (SAN), redundant arrays ofindependent disks (RAID), magnetic tapes or cassettes, DVDs, or anyother medium which can be used to store information in a non-transitoryway and which can be accessed by the server computers 110.

The data harvesters 111-113 can receive one or more search queries 140and can collect user-generated data associated with the search queries140. The search queries 140 can include one or more search terms and/orconnectors, one or more information channels from which to gather data,a time or data range, an expiration date, and/or a ceiling on the amountof data to collect. The user-generated data can include posts to socialmedia platforms, comments to articles posted at a network address, blogentries, or the like. The respective data harvesters 111-113 can beadapted to collect data from the one or more information channels. Aninformation channel can be associated with one or more of a networkaddress or web-site, a country, a range of network addresses, a webservice, or the like.

As one example, the data harvester 111 can be adapted to collectuser-generated data 152 from one or more server computers 150. Inparticular, the data harvester 111 can send Hyper-Text Transfer Protocol(HTTP) requests to an Internet Protocol (IP) address associated with theserver computers 150 and the user-generated data 152 can be returned inHTTP responses. The data harvester 111 can process the collecteduser-generated data 152. For example, the data harvester 111 can filterthe HTTP response data so that only information associated with a givensearch term is retained. As another example, the user-generated data 152can be unstructured data and the data harvester 111 can format theunstructured data into different fields. The data harvester 111 canannotate the collected user-generated data 152 with additional fields toindicate various aspects associated with the collection of the data. Forexample, fields can be added to indicate: a country code associated withthe country where the user-generated data 152 was collected from; achannel identifier to indicate a web-site where the user-generated data152 was collected from; a retrieval mode identifier to indicate whichdata harvester collected the user-generated data 152; a time-stamp toindicate when the data was collected and/or posted; or the like. Thus,the data harvester 111 can harvest unstructured data and reformat thedata into a unified structured format. A unified structured format caninclude a plurality of predefined fields so that all collected data canhave the same fields and can be analyzed in a similar manner. Missinginformation for a field can be represented with a null value. As anotherexample, the data harvester 112 can be adapted to collect only socialmedia posts 162 from one or more server computers 160. The servercomputers 160 can be associated with an IP address that is differentfrom the IP address associated with the server computers 150.

As another example, the data harvester 113 can be adapted to collectsocial media data using a social media retrieval service 172 executingon one or more server computers 170. The social media retrieval service172 can be used to query the pages of one or more social media platformsand to provide social media data matching the query in a structuredformat. For example, the social media retrieval service 172 can beintegrated within a single social media platform. As another example,the social media retrieval service 172 can be provided by a third party,and can be used to retrieve social media from multiple social mediaplatforms.

The social media data can be pushed by or pulled from the social mediaretrieval service 172 using an Application Programming Interface (API)of the social media retrieval service 172. As a specific example, thedata harvester 113 can establish a connection with the social mediaretrieval service 172. A request can be sent to the social mediaretrieval service 172. For example, the request can include a searchterm, an identifier associated with a previous request, and/orcredentials for accessing the social media retrieval service 172. Thesocial media retrieval service 172 can generate a stream identifierassociated with the request and transmit the stream identifier to thedata harvester 113. In one embodiment, the social media retrievalservice 172 can push data to the data harvester 113. For example, datamatching the search criteria can be transmitted to the data harvester113 periodically or when a given number of data entries have beenharvested. The transmitted data can be identified using the streamidentifier. In an alternative embodiment, the data harvester 113 canpoll the social media retrieval service 172 using the stream identifierso that data can be pulled from the social media retrieval service 172.In particular, the data can be transmitted from the social mediaretrieval service 172 in response to a specific request from the dataharvester 113.

The collected user-generated data from the data harvesters 111-113 canbe processed and classified by the data classifier 115. For example, thedata classifier 115 can be software executing on the server computers110. The data classifier 115 can use the consent rules 120 to determinehow to process and store the collected user-generated data. For example,the respective consent rules 120 can specify a format for storing dataassociated with at least a user identifier, a respective country code, asocial media channel, and/or a retrieval method. The data classifier 115can convert the collected user-generated data into the specified formatfor storage. The consent rules 120 can be applied to one or more fieldsof the collected user-generated data using various logical operations.For example, a consent rule can correspond to a given retrieval mode andsocial media channel. As a specific example, a consent rule can specifya storage format for social media data collected from the social mediachannel FACEBOOK using the social media retrieval service DATASIFT.Thus, any social media data collected from FACEBOOK using DATASIFT willmatch the consent rule. As another example, a consent rule cancorrespond to a given retrieval mode, social media channel, and countrycode. As a specific example, a consent rule can apply to social mediadata collected from Great Britain from the social media channel FACEBOOKusing the social media retrieval service DATASIFT. Thus, any socialmedia data collected from Great Britain from FACEBOOK using DATASIFTwill match the consent rule. The consent rules 120 can be applied in aprioritized order to classify each block of user-generated dataassociated with a different user. For example, consent rulescorresponding to a user can be applied before consent rulescorresponding to a country code and/or a social media channel. Asanother example, a consent rule matching more fields can be appliedbefore a consent rule matching fewer fields of the collecteduser-generated data.

Thus, a consent rule can comprise an exemplar retrieval mode, anexemplar social media channel, and a resulting storage format. Collecteduser data coming from a data harvester can be matched against theexemplar retrieval mode and exemplar social media channel for a givenrule. If there is a match, the data is stored in the resulting storageformat of the given rule. Retrieval modes can be represented byretrieval mode identifiers, social media channels can be represented bysocial media channel identifiers, and storage formats can be representedby storage format identifiers. Such an arrangement allows one to easilyadd additional retrieval modes, social media channels, or storageformats without having to re-code the implementation.

As a specific example, the consent rules 120 can include a “white-list”of users that have provided explicit consent to have their dataanalyzed. The white-list can include users within the collector'sorganization, users that have provided consent through terms andconditions of using a social media platform, and users that haveprovided consent directly to the collector's organization, for example.The consent rules 120 can include a “black-list” of users that haveexplicitly withheld or withdrawn consent to have their data analyzed.The black-list can also include users that provide less usefulinformation, such as spammers, for example. Each block can be classifiedaccording to the whether the user identifier associated with the blockmatches any of the user identifiers included within the white list orthe black list. When the user identifier does not match any of the useridentifiers on the white list or the black list, consent rules 120corresponding to the other fields of the user-generated data, such asthe country code, the social media channel, and/or the retrieval methodcan be used to classify the user-generated data. For example, a countrymay have no restrictions on using harvested data, may allow the use ofanonymized harvested data, or may require explicit user consent to usethe harvested data. As another example, a social media channel (e.g., awebsite) where user-generated data is posted may have terms andconditions that require the user to consent to having his or her dataharvested and analyzed. As another example, the retrieval method (e.g.,the data harvesters 111-113) may handle one or more aspects of theconsent handling, such as when a social media retrieval servicepre-filters data according to user settings.

Generally, the collected user-generated data for a particular user canbe stored in a record or row of a relational database. The relationaldatabase can include the storage 130, for example. The fields of thecollected user-generated data can be stored as columns within the row ofthe relational database. The number and types of fields for the rows canbe predefined and can be the same for each record. However, the amountand type of data collected from different respective users may bedifferent. For example, a first piece of collected user-generated datamay include the user's name and age, but a second piece of collecteduser-generated data may include only the user's name. Thus the secondpiece of collected user-generated data is missing information related toage. Missing or removed information can be represented by using a nullvalue in the corresponding field of the record.

The consent rules 120 can specify a format for storing the collecteduser-generated data. For example, the consent rules 120 can specify thatthe collected user-generated data is to be stored as anonymized data 131or non-anonymized data 132. The anonymized data 131 can beuser-generated data that has one or more aspects of user-identifyinginformation removed from the data before it is stored. For example, theuser-identifying information can include a user identifier, a name, anemail address, a login name or alias, a phone number, a physicaladdress, a gender, a birthdate or age, a marital status, a governmentidentifier number (such as a social security number), and/or an accountnumber. The consent rules 120 can specify which fields to remove (e.g.,the fields in which to store a null value) when the collecteduser-generated data is to be stored as anonymized data 131. Theuser-identifying information can be extracted from the collecteduser-generated data and stored separately as user-identifying data 134.For example, email addresses, user names, and/or phone numbers can bestored as user-identifying data 134. The user-identifying data 134 canbe used to generate mailing lists or phone lists, so that consent of theusers can be requested, for example.

When there are no matching consent rules 120 corresponding to thecollected user-generated data, the collected user-generated data can bestored as quarantined data 133. For example, the quarantined data 133can be stored in a physically separate storage device than theanonymized data 131 and the non-anonymized data 132. As another example,the quarantined data 133 can be stored in a locked format, such as bybeing encrypted. Services and/or applications that are used foranalyzing user-generated data can be denied access to the quarantineddata 133, such as by being denied read permission or network access tothe quarantined data 133, or by not having access to an encryption keyfor decrypting encrypted quarantined data 133. Storing theuser-generated data in the quarantined or locked format can includeblocking access to the social media data when there is no matchingconsent configuration rule. The access can be blocked for all servicesand applications that can access the storage 130 or for only aparticular downstream application. By blocking access to the quarantineddata 133, user sentiment associated with the collected user-generateddata is not retrievable until the quarantined data 133 is unlocked orremoved from quarantine. For example, the quarantined data 133 can beunlocked when a matching consent rule is created. As a specific example,user-generated data can be collected for a given country code that hasno corresponding consent rule defined for the given country code. Thecollected user-generated data can be stored in a locked format. If, at alater time, a consent rule is created that corresponds to the givencountry code, the collected user-generated data can be unlocked so thatapplications can analyze the user sentiment associated with thecollected user-generated data. Storing the user-generated data in thelocked format can include transmitting a notification comprising thecountry code or the user identifier associated with the user-generateddata. For example, the notification can be an email or Short MessageService (SMS) message sent to an administrator of the system 100. Thus,the administrator can be made aware that there is locked data, and theadministrator can potentially create a new consent rule to unlock thedata.

Example Dataflow Diagram for Consent Handling During Data Harvesting

FIG. 2 is an example dataflow diagram for consent handling during dataharvesting. In particular, a system 200 for harvesting and analyzingsocial media data can communicate with a social media retrieval service210 over an HTTP interface. The system 200 can include a data harvestingmodule or service 205, a database system 250, and social data analytics260. For example, the data harvesting module or service 205 can be usedto manage the collection of social media data, such as formattingrequests, receiving asynchronous communications, and formatting socialmedia data for downstream services or applications. In one embodiment,the data harvesting module or service 205 can include a batch jobgenerator 220, a social media retrieval service interface 230, and areal-time service 240.

The database system 250 (e.g., a database management system and thelike) can be used as a repository for the storage of consent rules,queries, and transformation rules. For example, the repository caninclude a set of tables that hold user-created and predefined systemobjects, source and target metadata, and transformation rules. A user ofthe system 200 can submit a query using the database system 250. Thequery can be submitted directly through a user interface of the databasesystem 250 or indirectly using a call from the social data analytics260. The query can include a search term, a retrieval mode, and one ormore social media channels. The query can be processed to generateconfiguration data for a batch job generator 220. The configuration datacan include information specific to the query, such as one or moresearch terms, and information specific to the retrieval mode and/or asubscriber of the retrieval service. As one example, the query can bequeued by the database system 250 so that the query can be launched whenresources of the system 200 (e.g., the batch job generator 220) areavailable.

The batch job generator 220 can be used to initiate harvesting of socialmedia data when the query is read from the head of the queue. Inparticular, the batch job generator 220 can read the configuration dataassociated with the query and format the configuration data forconsumption by an interface for harvesting social media data (such as asocial media retrieval service interface 230). For example, theconfiguration data can be sent as an eXtensible Markup Language (XML)document containing elements and attributes specifying information forinitiating harvesting of social media using the social media retrievalservice 210. As a specific example, the XML document can include aUniform Resource Locator (URL) associated with the social mediaretrieval service 210, a stream identifier, credentials for accessingthe social media retrieval service 210, and/or settings related to oneor more aspects of using the social media retrieval service 210.

The social media retrieval service interface 230 can be used forcommunication between the data harvesting module or service. Forexample, the social media retrieval service interface 230 can establisha connection to the social media retrieval service 210, transmit HTTPrequests to the social media retrieval service 210, and receive socialmedia data from the social media retrieval service 210. As one example,the social media retrieval service interface 230 can poll the socialmedia retrieval service 210 to determine if social media data is readyto be downloaded. As another example, the social media retrieval serviceinterface 230 can receive a message containing social media data fromthe social media retrieval service 210. The message can include a streamidentifier so that the social media retrieval service interface 230 canassociate the collected data to the query that requested the data. Forexample, the social media data can be returned in a JavaScript ObjectNotation (JSON) format. The social media retrieval service interface 230can parse the JSON data into different data fields. Fields can be addedto indicate additional information related to different aspectsassociated with the data query and/or the collection of the data. Thefields can reformatted as data within an XML document. The XML documentcan be communicated to the real-time service 240.

The real-time service 240 can transform the XML data and push the socialmedia data into data models of the database system 250. The databasesystem 250 can include a data store for storing the collected socialmedia data according to the consent rules. For example, the social mediadata can be formatted in an anonymous, non-anonymous, or locked formatbased on the format specified by a matching consent rule. If there is nomatching consent rule, the social media data can be formatted in thelocked format. The data store can provide a connection to downstreamapplications, such as the social data analytics 260, and backenddatabases. Thus, a connection can be created between data services andweb services.

Example of Collected Social Media Data

FIG. 3 is a block diagram of an example of collected social media data300. The social media data 300 can include content that is associatedwith an identity of a user, content that is generated by the user,content that can be derived from the social media channel, and contentthat is associated with the collection of the data.

Content that is associated with an identity of a user can include useridentifying information 310, for example. The user identifyinginformation 310 can include a user identifier, a name, an email address,a login name or alias, a phone number, a physical address, a gender, abirthdate or age, a marital status, a government identifier number (suchas a social security number), an occupation, an image of the user, ahomepage, an image of an avatar of the user, and/or an account number.Anonymizing social media data can include removing all or some of theuser identifying information 310 from the data. Non-anonymous socialdata can include keeping all or some of the user identifying information310 present in the original post. For example, redundant, contradictory,or immaterial information can be removed. The non-anonymous social datacan include annotating additional information to the user identifyinginformation 310. For example, a database of user identifying informationcan be maintained, and missing fields of the user identifyinginformation 310 can potentially be added by searching the database usingknown fields of the user identifying information 310. Thus, the useridentifying information 310 can include information that is obtainedfrom content that is generated by the user, content that is derived fromthe social media channel, and/or content that is obtained externallyfrom the social media channel (such as by a database of users).

Content that is generated by the user can include user-generated content320, for example. The user-generated content 320 can include text,audio, video, hyperlinks, status or sentiment indicators (such as likesand dislikes), and tags indicating a subject-matter of the content. Theuser-generated content 320 can be modified, edited, and/or annotatedprior to storage. For example, a video or audio file can be transcribedusing automated methods to reduce the storage size and to potentiallymake the content easier to analyze. As another example, sentiments canbe mined from user-generated text. In particular, different keywordsand/or punctuation can be assigned different values to indicate a levelof user sentiment. The sentiment level can be added to theuser-generated content 320.

Content that can be derived from the social media channel can include asocial media channel identifier 330 and a country code 340, for example.The social media channel identifier 330 can identify which social mediachannel the information was obtained from. The country code 340 canindicate the country where the user-generated data was created or thecountry where the user registered to use the social media platform, forexample. As a specific example, the country code can be encoded as thetwo-character International Organization for Standardization (ISO) codein accordance with ISO 3166. Other content that may be derived from thesocial media channel include a language of the content, a channelidentifier associated with the content, a creation time, a typeassociated with the content, a URL associated with the content, a numberof views of the content, a number of positive votes for the content, anumber of negative votes for the content, a number of contactsassociated with the user, a popularity rank of the user, and a location,latitude, and/or longitude associated with the content.

Content that is associated with the collection of the data can include aretrieval mode 350, a retrieval timestamp, a query identifier, a userassociated with a query, a status of the data (such as locked,anonymized, or non-anonymized), a stream identifier, and one or moresearch terms, for example. The retrieval mode 350 can indicate whichdata harvester of a plurality of data harvesters were used to collectthe data. Other annotated data fields 360 can be any information relatedto the user, social media channel, collection, identification, and/oranalysis of the social media data 300. For example, the other annotateddata fields 360 can be added by the data harvester, database system,and/or social data analytics engine.

Example Methods for Managing User Consent During Data Harvesting

FIGS. 4-5 are flow charts illustrating various example methods formanaging user consent during data harvesting. Specifically, FIG. 4 is aflow chart illustrating an example method 400 for managing user consentduring data harvesting.

At 410, social media data associated with a user is collected. Forexample, the social media data can be collected using a data harvestingprogram, such as data harvesters 111-113. The data harvesting programcan collect the social media data by crawling pages associated with asocial media channel, or the data harvesting program can collect thesocial media data by using a social media retrieval service, forexample. Collecting the social media data can include parsing the socialmedia data and dividing the data into different fields and annotatingthe social media data with additional information.

At 420, it can be determined whether the user associated with the socialmedia data is on a black-list of users. For example, the black-list ofusers can include users that have explicitly withheld or withdrawnconsent to have their data analyzed. If the user associated with thesocial media data is on the black-list of users, then at 430, thecollected social media data can be deleted. If the user associated withthe social media data is not on the black-list of users, then at 440, itcan be determined whether the user associated with the social media datais on a white-list of users. For example, the white-list of users caninclude users that have provided explicit consent to have their dataanalyzed.

If the user associated with the social media data is on the white-listof users, then at 450, the collected social media data can be stored inan anonymized on non-anonymized format based on a level of user consent.For example, a consent level can be associated with a respective user onthe white list of users. The consent level can indicate whether to storecollected social media data associated with the user in an anonymous ornon-anonymous format. The level of consent can be explicitly provided bythe user or may be derived from a user agreement or laws of a countrywhere the social media data was generated. In an alternative embodiment,if the user associated with the social media data is on the white-listof users, then the collected social media data can always be stored in anon-anonymized format.

If the user associated with the social media data is not on theblack-list or the white-list of users, then at 500, an additionalanalysis can be performed before storing the collected social mediadata. In one embodiment, the comparison to the white-list and/or theadditional analysis can be optional. For example, a setting can beprovided to enable or disable the additional analysis. By disabling theadditional analysis, the overhead of the processing of social media datacan be reduced. When the additional analysis is disabled, the socialmedia data can be stored in an anonymous or non-anonymous format.

FIG. 5 is a flow chart illustrating an example method 500 for performingthe additional analysis. At 510, a country code, social media channel,and/or retrieval mode associated with the social media data can bedetermined. At 520, it can be determined whether there is a matchingconsent rule for the country code, social media channel, and/orretrieval mode. A consent rule matches when value(s) of a country code,social media channel, and/or retrieval mode associated with the consentrule match values of corresponding fields of the social media data. Theconsent rules can be prioritized so that only one consent rule isselected when multiple consent rules match. As one example, consentrules for countries can be prioritized over consent rules for socialmedia channels. The consent rules can be associated with multiple fieldsof the social media data. For example a consent rule can be associatedwith all of a given country code, social media channel, and retrievalmode. As another example, a consent rule can be associated with a givencountry code and social media channel.

When there is no matching consent rule, at 530, the collected socialmedia data can be stored in quarantine. Storing the collected socialmedia data in quarantine can include storing the quarantined data in alimited-access physically separate storage device and/or storing thequarantined data in a locked format. For example, storing theuser-generated data in quarantine can include blocking access to thesocial media data from one or more downstream services and/orapplications. By blocking access to the locked data, the useridentifying information and user-generated content associated with thecollected social media data is not retrievable until the locked data isunlocked. For example, the locked data can be unlocked when a matchingconsent rule is created. Storing the social media data in quarantine caninclude writing to a log file and/or transmitting a notificationcomprising the country code or the user identifier associated with theuser-generated data. Thus, the administrator can be made aware thatthere is locked data, and the administrator can create a new consentrule to unlock the data.

When there is a matching consent rule, at 540, it can be determinedwhether personal data associated with the collected social media datacan be retained. For example, the matching consent rule can specifywhether to anonymize the data or whether to store the datanon-anonymized.

When the matching consent rule specifies personal data cannot beretained, at 550, anonymized social media data can be stored. Thematching consent rule can specify how to anonymize the data, such as byindicating which fields of the data to delete before storing the data.

When the matching consent rule specifies personal data can be retained,at 560, non-anonymized social media data can be stored. The personaldata can be obtained from the original collected social media dataand/or annotated with personal data obtained from other databases, forexample.

At 570, user contact information associated with the social media datacan optionally be stored separately from the social media data. Byseparating the social media data from the user contact information, thesentiments of the user can be hidden. The user contact information canpotentially be used to request explicit consent from the user to analyzehis or her data. Thus, the user can potentially be added to theblack-list or the white list depending on the user's response.

Example Method for Processing a Request for Social Media Data

FIG. 6 is a flow chart illustrating an example method 600 for processinga request for social media data. For example, the request can be from adownstream application, such as a social media data analytics service.As another example, the request can be in response to adding a newconsent rule so that previously quarantined data matching the newconsent rule can be made available to applications that may use thedata. The request can include one or more search terms and fields foridentifying the desired social media data. The request can include whichfields to return in a response.

When social media data matching the request is present, at 620, it canbe determined whether the social media data is quarantined. For example,it can be determined whether the social media data is stored in a lockedformat. When the social media data is not quarantined, at 630, thesocial media data can be returned. All fields of the social media datacan be returned or a subset of the fields of the social media data canbe returned, based on the requested information. When the social mediadata is quarantined, at 640, it can be determined whether there is amatching consent rule corresponding to the social media data. Forexample, a matching consent rule may have been added after the socialmedia data was collected and stored in quarantine. As one example, amatching consent rule may have been added for the country code, socialmedia channel, and/or retrieval mode. As another example, the userassociated with the quarantined social media data may have been added tothe white-list.

At 650, when there is a matching consent rule corresponding to thesocial media data, one or more fields of the social media data can bereturned in a format according to the matching rule. For example, thematching rule may specify that the social media data is to beanonymized. Thus, user-identifying information can be removed whenreturning the formatted social media data to the requestor. At 660, therecord corresponding to the social media data can optionally be movedand/or reformatted according to the matching consent rule. For example,quarantined data stored in a storage device separate fromnon-quarantined data can be extracted from the storage device associatedwith quarantined data, and stored on the storage device associated withnon-quarantined data. Extracting the quarantined data can includedecrypting the quarantined data and generating a new data file bycopying the unencrypted data into the new data file. The previouslyquarantined data can be deleted from the quarantine storage device.Thus, the record can be updated to reflect the current state of theconsent rule and a conversion can be skipped the next time that therecord is accessed.

At 670, when the social media data is locked and there is no matchingconsent rule, access to the social media data is denied and anindication of the denial can be sent to the requestor. For example, aresponse including an error code can be sent to the requestor. Thus, theaccess to the collected social media data is blocked and the usersentiment contained within the collected social media data is notretrievable until a matching consent rule is created.

Additional Example Method for Managing User Consent During DataHarvesting

FIG. 7 is a flow chart illustrating an example method 700 for managinguser consent during data harvesting. At 710, user-generated data, suchas social media data, is harvested. For example, one or more dataharvesting programs (such as data harvesters 111-113, 205) can be usedto collect structured or unstructured social media data from varioussocial media platforms. The data can be harvested according to a querycomprising various search criteria, for example. The social media datacan be divided into a plurality of fields including fieldsrepresentative of a user identifier, a country code, and/or a socialmedia channel. The social media data can be annotated with additionalfields representative of content that is associated with the identity ofa user, content that can be derived from the social media channel, andcontent that is associated with the collection of the data, for example.

At 720, consent rules associated with a country code and/or aninformation channel can be stored. The information channel can be asocial media channel, for example. The respective consent rules canspecify a format for storing data associated with at least a respectivecountry code or a respective information channel. For example, thespecified formats can include an anonymous format, a non-anonymousformat, and a locked format. When the anonymous format is used, fieldsto exclude (e.g., fields where null values are stored) can be specifiedby the consent rule. Thus, an author of the consent rule can potentiallymake a determination as to which fields are to be excluded whenanonymizing the user-generated data. Furthermore, the anonymization canbe tuned on a country-by-country basis.

At 730, it can be determined whether there is a matching consent rulecorresponding to the country code and/or information channel of theharvested user-generated data. For example, consent rules can be storedfor Russia (e.g., the country code is RU) and for Great Britain (e.g.,the country code is GB). If the user-generated data is harvested from aweb-site operating from Russia with Russian users, the Russian consentrule will match and the Great Britain consent rule will not match. Ifthe user-generated data is harvested from a web-site operating fromAustralia with Australian users, neither the Russian consent rule northe Great Britain consent rule will match.

At 740, when there is a matching consent rule, the collecteduser-generated data can be stored according to the matching consentrule. For example, the consent rule for Russia can specify to store theharvested user-generated data in an anonymous format. Thus, whenuser-generated data is harvested from a Russian user, all of theuser-identifiable fields can be removed from the harvesteduser-generated data as it is stored. Alternatively, the consent rule forRussia can specify specific fields to remove, such as a name, anaddress, and a telephone number. In this example, other fields, such asgender, age, and an email address can be retained while the name,address, and telephone number fields are removed.

At 750, the collected user-generated data can be stored in a quarantinewhen there is no matching consent rule. Storing the user-generated datain the quarantine can include blocking access to the user-generated datawhen there is no matching consent rule. For example, the user-generateddata can be stored in a limited-access storage device and/or theuser-generated data can be encoded or encrypted in a locked format thatis only readable by a limited number of applications. For example,access to the collected user-generated data by a downstream application,such as a data analytics program, can be blocked. Storing theuser-generated data in the quarantine can include logging informationand/or transmitting a notification comprising the country code or theuser identifier associated with the user-generated data. For example, ifAustralian user-generated data is collected and there is no matchingconsent rule, the Australian user-generated data can be stored in aquarantine and a notification to an administrator can indicate thatAustralian data has been collected and quarantined. Thus, theadministrator can generate a consent rule for the Australia country codeso that the Australian data can be removed from the quarantine.

Example Computing Environment

FIG. 8 depicts a generalized example of a suitable computing environment(e.g., computing system) 800 in which the described innovations may beimplemented. The computing environment 800 is not intended to suggestany limitation as to scope of use or functionality, as the innovationsmay be implemented in diverse general-purpose or special-purposecomputing systems. For example, the computing environment 800 can be anyof a variety of computing devices (e.g., desktop computer, laptopcomputer, server computer, tablet computer, etc.).

With reference to FIG. 8, the computing environment 800 includes one ormore processing units 810, 815 and memory 820, 825. In FIG. 8, thisbasic configuration 830 is included within a dashed line. The processingunits 810, 815 execute computer-executable instructions. A processingunit can be a general-purpose central processing unit (CPU), processorin an application-specific integrated circuit (ASIC) or any other typeof processor. In a multi-processing system, multiple processing unitsexecute computer-executable instructions to increase processing power.For example, FIG. 8 shows a central processing unit 810 as well as agraphics processing unit or co-processing unit 815. The tangible memory820, 825 may be volatile memory (e.g., registers, cache, RAM),non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or somecombination of the two, accessible by the processing unit(s). The memory820, 825 stores software 880 implementing one or more innovationsdescribed herein, in the form of computer-executable instructionssuitable for execution by the processing unit(s).

A computing system may have additional features. For example, thecomputing environment 800 includes storage 840, one or more inputdevices 850, one or more output devices 860, and one or morecommunication connections 870. An interconnection mechanism (not shown)such as a bus, controller, or network interconnects the components ofthe computing environment 800. Typically, operating system software (notshown) provides an operating environment for other software executing inthe computing environment 800, and coordinates activities of thecomponents of the computing environment 800.

The tangible storage 840 may be removable or non-removable, and includesmagnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any othermedium which can be used to store information in a non-transitory wayand which can be accessed within the computing environment 800. Thestorage 840 stores instructions for the software 880 implementing one ormore innovations described herein. For example, the rules engine andothers described herein can be the software 880 executed from the memory820.

The input device(s) 850 may be a touch input device such as a keyboard,mouse, pen, or trackball, a voice input device, a scanning device, oranother device that provides input to the computing environment 800. Theoutput device(s) 860 may be a display, printer, speaker, CD-writer, oranother device that provides output from the computing environment 800.

The communication connection(s) 870 enable communication over acommunication medium to another computing entity. The communicationmedium conveys information such as computer-executable instructions,audio or video input or output, or other data in a modulated datasignal. A modulated data signal is a signal that has one or more of itscharacteristics set or changed in such a manner as to encode informationin the signal. By way of example, and not limitation, communicationmedia can use an electrical, optical, RF, or other carrier.

Although direct connection between computer systems is shown in someexamples, in practice, components can be arbitrarily coupled via anetwork that coordinates communication.

Although the operations of some of the disclosed methods are describedin a particular, sequential order for convenient presentation, it shouldbe understood that this manner of description encompasses rearrangement,unless a particular ordering is required by specific language set forthbelow. For example, operations described sequentially may in some casesbe rearranged or performed concurrently. Moreover, for the sake ofsimplicity, the attached figures may not show the various ways in whichthe disclosed methods can be used in conjunction with other methods.

Any of the disclosed methods can be implemented as computer-executableinstructions stored on one or more computer-readable storage media(e.g., one or more optical media discs, volatile memory components (suchas DRAM or SRAM), or nonvolatile memory components (such as flash memoryor hard drives)) and executed on a computer (e.g., any commerciallyavailable computer, including smart phones or other mobile devices thatinclude computing hardware). The term computer-readable storage mediadoes not include communication connections, such as signals and carrierwaves. Any of the computer-executable instructions for implementing thedisclosed techniques as well as any data created and used duringimplementation of the disclosed embodiments can be stored on one or morecomputer-readable storage media. The computer-executable instructionscan be part of, for example, a dedicated software application or asoftware application that is accessed or downloaded via a web browser orother software application (such as a remote computing application).Such software can be executed, for example, on a single local computer(e.g., any suitable commercially available computer) or in a networkenvironment (e.g., via the Internet, a wide-area network, a local-areanetwork, a client-server network (such as a cloud computing network), orother such network) using one or more network computers.

For clarity, only certain selected aspects of the software-basedimplementations are described. Other details that are well known in theart are omitted. For example, it should be understood that the disclosedtechnology is not limited to any specific computer language or program.For instance, the disclosed technology can be implemented by softwarewritten in C++, Java, Perl, JavaScript, Adobe Flash, or any othersuitable programming language. Likewise, the disclosed technology is notlimited to any particular computer or type of hardware. Certain detailsof suitable computers and hardware are well known and need not be setforth in detail in this disclosure.

It should also be well understood that any functionality describedherein can be performed, at least in part, by one or more hardware logiccomponents, instead of software. For example, and without limitation,illustrative types of hardware logic components that can be used includeField-Programmable Gate Arrays (FPGAs), Application-Specific IntegratedCircuits (ASICs), Application-Specific Standard Products (ASSPs),System-On-a-Chip systems (SOCs), Complex Programmable Logic Devices(CPLDs), etc.

Furthermore, any of the software-based embodiments (comprising, forexample, computer-executable instructions for causing a computer toperform any of the disclosed methods) can be uploaded, downloaded, orremotely accessed through a suitable communication means. Such suitablecommunication means include, for example, the Internet, the World WideWeb, an intranet, software applications, cable (including fiber opticcable), magnetic communications, electromagnetic communications(including RF, microwave, and infrared communications), electroniccommunications, or other such communication means.

The disclosed methods, apparatus, and systems should not be construed aslimiting in any way. Instead, the present disclosure is directed towardall novel and nonobvious features and aspects of the various disclosedembodiments, alone and in various combinations and sub-combinations withone another. The disclosed methods, apparatus, and systems are notlimited to any specific aspect or feature or combination thereof, nor dothe disclosed embodiments require that any one or more specificadvantages be present or problems be solved.

Non-Transitory Computer-Readable Media

Any of the computer-readable media herein can be non-transitory (e.g.,memory, magnetic storage, optical storage, solid-state drives, or thelike).

Storing in Computer-Readable Media

Any of the storing actions described herein can be implemented bystoring in one or more computer-readable media (e.g., computer-readablestorage media or other tangible media).

Any of the things described as stored can be stored in one or morecomputer-readable media (e.g., computer-readable storage media or othertangible media).

Methods in Computer-Readable Media

Any of the methods described herein can be implemented bycomputer-executable instructions in (e.g., encoded on) one or morecomputer-readable media (e.g., computer-readable storage media or othertangible media). Such instructions can cause a computer to perform themethod. The technologies described herein can be implemented in avariety of programming languages.

Methods in Computer-Readable Storage Devices

Any of the methods described herein can be implemented bycomputer-executable instructions stored in one or more computer-readablestorage devices (e.g., memory, magnetic storage, optical storage,solid-state drives, or the like). Such instructions can cause a computerto perform the method.

Alternatives

The technologies from any example can be combined with the technologiesdescribed in any one or more of the other examples. In view of the manypossible embodiments to which the principles of the disclosed technologymay be applied, it should be recognized that the illustrated embodimentsare examples of the disclosed technology and should not be taken as alimitation on the scope of the disclosed technology. Rather, the scopeof the disclosed technology includes what is covered by the followingclaims. We therefore claim as our invention all that comes within thescope and spirit of the claims.

We claim:
 1. One or more computer-readable storage media comprisingcomputer-executable instructions for a processor, that when executed,cause the processor to: receive social media data associated with a useridentifier and a first country code; store a consent configuration rulespecifying whether to store the social media data anonymously ornon-anonymously, the consent configuration rule associated with a secondcountry code; determine whether the second country code associated withthe consent configuration rule matches the first country code associatedwith the social media data; store the social media data in a quarantinewhen the second country code associated with the consent configurationrule does not match the first country code associated with the socialmedia data; determine whether the user identifier is on a white-list andstore the social media data in non-quarantine storage when the useridentifier is on the white-list; and determine whether the useridentifier is on a black-list and delete the social media data when theuser identifier is on the black-list.
 2. The one or morecomputer-readable storage media of claim 1, further comprisingcomputer-executable instructions for the processor, that when executed,cause the processor to: when the second country code associated with theconsent configuration rule matches the first country code associatedwith the social media data, store the social media data according to theconsent configuration rule.
 3. The one or more computer-readable storagemedia of claim 2, wherein the consent configuration rule specifiesstoring the social media data anonymously, and storing the social mediadata anonymously comprises removing the user identifier from the storedsocial media data.
 4. The one or more computer-readable storage media ofclaim 1, wherein storing the social media data in the quarantinecomprises blocking access to the social media data when the secondcountry code associated with the consent configuration rule does notmatch the first country code associated with the social media data. 5.The one or more computer-readable storage media of claim 1, whereinstoring the social media data in the quarantine comprises transmitting anotification comprising the first country code associated with thesocial media data.
 6. The one or more computer-readable storage media ofclaim 1, further comprising computer-executable instructions for theprocessor, that when executed, cause the processor to: unlock the socialmedia data stored in the quarantine in response to adding a new consentconfiguration rule associated with a third country code that matches thefirst country code associated with the social media data stored in thequarantine.
 7. The one or more computer-readable storage media of claim1, further comprising computer-executable instructions for theprocessor, that when executed, cause the processor to: transmit arequest comprising a stream identifier associated with one or moresearch terms to a social media retrieval service, and wherein thereceived social media data is associated with at least the streamidentifier.
 8. A method implemented at least in part by a computingsystem, the method comprising: harvesting social media data to generateharvested social media data, the harvested social media data comprisinga social media channel, user identifying information, and a countrycode; storing a plurality of consent rules, a respective consent rulespecifying a format for storing data associated with at least arespective country code or a respective social media channel;determining whether there is a matching consent rule of the plurality ofconsent rules corresponding to the social media channel or the countrycode of the harvested social media data; when there is a matchingconsent rule, storing the harvested social media data in the formatspecified by the matching consent rule of the plurality of consentrules; determining whether the user identified by the user identifyinginformation is on a white-list and wherein the harvested social mediadata is stored in non-quarantine storage when the identified user is onthe white-list; and determining whether a user identified from the useridentifying information is on a black-list and delete the harvestedsocial media data when the identified user is on the black-list.
 9. Themethod of claim 8, wherein the format for storing the data associatedwith at least the respective country code or the respective social mediachannel is anonymous, non-anonymous, or locked.
 10. The method of claim9, wherein the user identifying information of the harvested socialmedia data is removed before storing the harvested social media datawhen the format specified by the matching consent rule is anonymous. 11.The method of claim 8, further comprising: when there is no matchingconsent rule, storing the harvested social media data in a locked formatso that user sentiment is not retrievable until a matching consent ruleis created.
 12. The method of claim 8, wherein harvesting social mediadata comprises transmitting one or more search terms to a social mediaretrieval service and receiving structured data from the social mediaretrieval service.
 13. The method of claim 8, wherein harvesting socialmedia data comprises retrieving unstructured data matching one or moresearch terms and formatting the unstructured data into a unified format.14. The method of claim 8, wherein a respective consent rule of theplurality of consent rules specifies one or more fields to remove whenanonymizing the social media data.
 15. A system for consent handlingduring data harvesting, the system comprising: a data harvester forcollecting social media data, the collected social media data comprisingone or more fields representative of a user identifier, a country code,or a social media channel; a first storage device for storing a whitelist and a plurality of consent rules, a respective consent rule formatching a value of a given field of the collected social media data toa format for storing the social media data; a non-quarantine storagedevice for storing collected social media data; a quarantine storagedevice for storing collected social media data, the quarantine storagedevice physically separate from the non-quarantine storage device; and adatabase system in communication with the data harvester and the storagedevices, the database system configured to: determine whether the useridentifier of the collected social media data is on the white-list andstore the collected social media data in the non-quarantine storage whenthe user identifier is on the white-list; determine whether the useridentifier of the collected social media data is on a black-list anddelete the collected social media data when the user identifier is onthe black-list; determine whether any of the plurality of consent rulesmatch the value of the given field of the collected social media data;when there is a match, store the collected social media data on thenon-quarantine storage device in the format of the matching consentrule; and when there is no match, store the collected social media dataon the quarantine storage device.
 16. The system of claim 15, whereinthe data harvester is configured to: transmit a request comprising astream identifier associated with one or more search terms to a socialmedia retrieval service, and wherein the collected social media data isassociated with at least the stream identifier.
 17. The system of claim15, wherein storing the collected social media data in the format of thematching consent rule comprises removing the user identifier from thecollected social media data.
 18. The system of claim 15, wherein storingthe collected social media data on the quarantine storage devicecomprises storing the collected social media data in a locked format sothat access to the collected social media data is blocked when there isno matching consent rule.